Motivation

In collaboration with Juthi Dewan, Sam Ding, and Vichy Meas, we designed this project for our Bayesian Statistics course taught by Dr. Alicia Johnson. We’d like to extend our thanks to Alicia for guiding us through Bayes and the capstone experience!

A reproducible version of this blog post with all code can be found here.

We were initially interested in characterizing New York City’s internal racial dynamics using demography, geographic mobility, community health, and economic outcomes. As this project developed, we found ourselves thinking about the relationships between transportation (in)access and housing inequity. In our project, there are two major sections: Subway Accessibility and Transportation and Structural Inequity. In Subway Accessibility we explore transportation deserts and what are the major determinants of Subway Inaccess in New York City using two Bayesian classification models. While in Transportation and Structural Inequity, we extend our discussion of transportation access to study its relationship to both rental prices and evictions using Bayesian multivariable regression.

First, however, let’s do a data introduction:

Data Introduction

# Load packages
library(tidyverse)
library(janitor)
library(here)
library(rstan)
library(rstanarm)
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
library(bayesplot)
library(sf)
library(nycgeo)
library(tidycensus)
library(extrafont)
library(extrafontdb)
# themes
theme_set(theme_minimal())
brown_green <- c("#E9DBC2","#7D9B8A","#4D6F5C","#D29B5B","#744410","#1C432D")
color_scheme_set(brown_green)
nyc_join <- merge(nta_sf,nta_acs_data)


nyc_join <- nyc_join %>%
  st_transform(., 4269)

county_list <- nyc_join %>% pull(county_name) %>% unique()


census_api_key("0cc07f06386e317f312adef5e0892b0d002b7254")

census_data <- get_acs(state = "NY", 
        county = c(county_list), 
        geography = "tract", 
        variables = c(gini_inequality ="B19083_001"),
        year = 2019,
        output = "wide",
        survey = "acs5",
        geometry = TRUE) %>% 
  dplyr::select(-c(NAME, ends_with("M"))) %>%
         rename_at(vars(ends_with("E")), .funs = list(~str_sub(., end = -2)))  %>%
  st_transform(., 4269) %>%
  dplyr::select(-GEOID)
vari_names <- read_csv(here("clean_data", "nyc_names_vichy_complied2.csv"))
nyc_clean <- st_read(here("clean_data", "nyc_data_vichy_complied2.shp"))
colnames(nyc_clean) <- colnames(vari_names)

gini_neighborhood <- st_join(nyc_clean, census_data, left = TRUE) %>%
  group_by(nta_id) %>%
  summarize(gini_neighborhood=median(gini_inequality, na.rm=T)) %>%
  as.tibble() %>%
  dplyr::select(nta_id, gini_neighborhood) 

nyc_clean <- nyc_clean %>%
  as.tibble() %>%
  left_join(., gini_neighborhood, by="nta_id")%>% 
  unique() %>%
  st_as_sf()

nyc_compiled <- nyc_clean %>%
   mutate(asian_perc = asian_count / total_pop) %>%
   mutate(white_perc = white_count / total_pop) %>%
   mutate(black_perc = black_count / total_pop) %>%
   mutate(latinx_perc = latinx_count / total_pop) %>%
   mutate(native_perc = native_count / total_pop) %>%
   mutate(noncitizen_perc = noncitizen_count / total_pop) %>%
   mutate(evictions_perc = eviction_count / total_pop) %>%
   mutate(uninsured_perc = uninsured_count / total_pop) %>%
  mutate(unemployment_perc = unemployment_count / total_pop) %>%
  mutate(below_poverty_line_perc = below_poverty_line_count / total_pop) %>%
  mutate(transportation_desert_4cat = 
           factor(transportation_desert_4cat, levels=c("Poor", "Limited", "Satisfactory", "Excellent")))

All data used in this project are from two major sources: the Tidycensus package and NYC Open Data.

Tidycensus is an R package interface, developed by Kyle Walker and Matt Herman, that enables easy access to the US Census Bureau’s data APIs and returns Tidyverse-ready data frames from various major US Census Bureau datasets. Our demographic and socioeconomic data are drawn from the 2019 American Community Survey results found in Tidycensus package. A summary of our ACS data variables is below:

  • borough: NYC Borough
  • total_pop: Total Population Count by Census Tract
  • mean_income: Mean Income by Census Tract
  • below_poverty_line_count: Number of People Living Below the 100% Poverty Line by Census Tract
  • mean_rent: Mean Rent by Census
  • unemployment_count: Number of People on Unemployment by Census Tract
  • latinx_count: Number of Latinx People by Census Tract
  • white_count: Number of White People by Census Tract
  • black_count: Number of Black People by Census Tract
  • native_count: Number of Native People by Census Tract
  • asian_count: Number of Asian People by Census Tract
  • naturalized_citizen_count: Number of Naturalized Citizens by Census Tract
  • noncitizen_count: Number of Foreign Born People by Census Tract
  • uninsured_count: Number of Uninsured Citizens of any Age by Census Tract
  • gini_neighborhood : Income inequality measured by the Gini Index per Census Tract

Note that these predictors are all measured at the census level. To aggregate these estimates at the neighborhood level, we performed two transformations:

  • Compute total sums by neighborhood for every count-based measurement.
  • Compute mean average estimates by neighborhood for remaining non-count predictors.

Then, for count based demographic predictors, we divide by the total population in each neighborhood to define scaled demographic metrics. They are as follows:

  • asian_perc: Percentage of Asian People by Neighborhood
  • white_perc: Percentage of White People by Neighborhood
  • black_perc: Percentage of Black People by Neighborhood
  • latinx_perc: Percentage of Latinx People by Neighborhood
  • native_perc: Percentage of Native People by Neighborhood
  • noncitizen_perc: Percentage of Foreign Born People by Neighborhood
  • uninsured_perc: Percentage of Uninsured Citizens of any Age by Neighborhood
  • unemployment_perc: Percentage of People on Unemployment by Neighborhood
  • below_poverty_line_perc: Percentage of people living below the 100% poverty line.

We used NYC’s Open Data portal and the Baruch College GIS Lab to collect information on the remaining predictors. In particular, we used geotagged locations of Subway Stops, Bus Stops, Grocery Stores, Schools, and Evictions from the Departments of Transportation, Health, Education, and Housing, respectively, to calculate the following variables:

  • school_count: Public school counts by Neighborhood
  • eviction_count: Eviction counts by Neighborhood
  • store_count: Grocery store and food vendor counts by Neighborhood
  • sub_count: Subway station counts by Neighborhood
  • bus_count: Bus station counts by Neighborhood
  • perc_covered_by_transit: Percent of Neighborhood Within Walking Distance (.5 miles) of Any Subway Stop.
  • transportation_desert_4cat: Subway Accessibility by Neighborhood (Poor, Limited, Satisfactory, Excellent)

The process involved grouping geotagged locations by the defined neighborhood boundary regions in R’s SF package and ArcGIS. We will detail the process of identifying subway deserts in the “Subway Deserts” section.

Data Summaries

We present a numeric summary on our data with 224 observations of 38 variables below. However, note that we use percent equivalents for most demographic count variables.

#summary(modeling_data)
library(table1)
table_print <- table1(~ 
                        total_pop + mean_income + mean_rent +
                        gini_neighborhood + 
                        below_poverty_line_perc + 
                        eviction_count + 
                        transportation_desert_4cat + 
                        noncitizen_perc + white_perc + black_perc +
                        latinx_perc + asian_perc
                        | borough, data = nyc_compiled %>% as_tibble())  %>% as_tibble() 

colnames(table_print) <- c("Variable", "Bronx (N=44)", "Brooklyn (N=64)","Manhattan (N=39)", "Queens (N=77)", "Overall (N=224)")

library(kableExtra)
table_print%>%
  filter(Variable!="") %>% kable() %>% kable_styling() %>% scroll_box(height ="400px", width="100%")
Variable Bronx (N=44) Brooklyn (N=64) Manhattan (N=39) Queens (N=77) Overall (N=224)
total_pop
  Mean (SD) 30200 (18700) 37800 (24600) 38300 (24600) 25900 (20900) 32300 (22800)
  Median [Min, Max] 29800 [0, 69200] 38300 [0, 97800] 35700 [0, 95300] 25000 [0, 87700] 31600 [0, 97800]
mean_income
  Mean (SD) 43400 (17900) 69600 (28700) 103000 (49700) 74500 (14400) 71900 (33900)
  Median [Min, Max] 38000 [23100, 94200] 61200 [27400, 148000] 108000 [33300, 212000] 72600 [37500, 104000] 67200 [23100, 212000]
  Missing 8 (18.2%) 9 (14.1%) 6 (15.4%) 19 (24.7%) 42 (18.8%)
mean_rent
  Mean (SD) 1230 (197) 1580 (465) 1960 (674) 1650 (198) 1600 (465)
  Median [Min, Max] 1240 [833, 1620] 1450 [792, 3280] 2070 [884, 3270] 1630 [1140, 2250] 1510 [792, 3280]
  Missing 8 (18.2%) 9 (14.1%) 6 (15.4%) 19 (24.7%) 42 (18.8%)
gini_neighborhood
  Mean (SD) 0.467 (0.0317) 0.462 (0.0292) 0.516 (0.0379) 0.423 (0.0260) 0.459 (0.0439)
  Median [Min, Max] 0.467 [0.407, 0.519] 0.466 [0.382, 0.515] 0.525 [0.436, 0.582] 0.420 [0.358, 0.484] 0.457 [0.358, 0.582]
below_poverty_line_perc
  Mean (SD) 0.251 (0.122) 0.197 (0.143) 0.168 (0.129) 0.119 (0.0832) 0.179 (0.128)
  Median [Min, Max] 0.251 [0, 0.444] 0.179 [0, 1.00] 0.126 [0, 0.576] 0.103 [0, 0.615] 0.148 [0, 1.00]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)
eviction_count
  Mean (SD) 438 (340) 245 (234) 223 (233) 126 (131) 238 (255)
  Median [Min, Max] 406 [1.00, 1130] 163 [1.00, 829] 152 [1.00, 1120] 93.0 [1.00, 521] 148 [1.00, 1130]
transportation_desert_4cat
  Poor 4 (9.1%) 8 (12.5%) 2 (5.1%) 40 (51.9%) 54 (24.1%)
  Limited 12 (27.3%) 14 (21.9%) 0 (0%) 18 (23.4%) 44 (19.6%)
  Satisfactory 8 (18.2%) 12 (18.8%) 7 (17.9%) 9 (11.7%) 36 (16.1%)
  Excellent 20 (45.5%) 30 (46.9%) 30 (76.9%) 10 (13.0%) 90 (40.2%)
noncitizen_perc
  Mean (SD) 0.174 (0.0936) 0.150 (0.132) 0.139 (0.0506) 0.174 (0.101) 0.160 (0.103)
  Median [Min, Max] 0.169 [0, 0.581] 0.127 [0, 1.00] 0.133 [0, 0.242] 0.143 [0, 0.471] 0.143 [0, 1.00]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)
white_perc
  Mean (SD) 0.114 (0.163) 0.381 (0.277) 0.454 (0.268) 0.287 (0.247) 0.309 (0.270)
  Median [Min, Max] 0.0313 [0, 0.606] 0.385 [0, 1.00] 0.496 [0, 0.829] 0.247 [0, 1.00] 0.222 [0, 1.00]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)
black_perc
  Mean (SD) 0.301 (0.185) 0.292 (0.308) 0.146 (0.179) 0.178 (0.273) 0.231 (0.261)
  Median [Min, Max] 0.274 [0.0631, 0.833] 0.152 [0, 1.00] 0.0585 [0.00873, 0.591] 0.0406 [0, 0.899] 0.121 [0, 1.00]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)
latinx_perc
  Mean (SD) 0.531 (0.197) 0.190 (0.163) 0.246 (0.206) 0.255 (0.178) 0.292 (0.221)
  Median [Min, Max] 0.618 [0, 0.784] 0.143 [0, 0.808] 0.170 [0.0608, 0.724] 0.217 [0, 0.856] 0.203 [0, 0.856]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)
asian_perc
  Mean (SD) 0.0370 (0.0496) 0.108 (0.121) 0.127 (0.116) 0.241 (0.190) 0.138 (0.156)
  Median [Min, Max] 0.0179 [0, 0.217] 0.0733 [0, 0.472] 0.113 [0, 0.661] 0.205 [0, 0.757] 0.0777 [0, 0.757]
  Missing 3 (6.8%) 6 (9.4%) 3 (7.7%) 15 (19.5%) 27 (12.1%)

We found that Manhattan had the largest population counts, highest mean rental prices, highest mean income, highest income inequality (e.g. gini value), most number of neighborhoods with excellent subway access, and the greatest proportion of white citizens.

Conversely, we observed that the Bronx has the lowest mean income and highest eviction counts, while having the highest proportions of people living below the poverty line, and the highest number of people with limited subway access. Importantly, the Bronx also had the highest densities of both Latinx and Black residents of any other borough in New York, meaning that the Black and Latinx residents in New York are experiencing the burden of New York’s structural inequities.

We will touch on the connections between demographics, transportation access, and housing more explicitly in the following sections.

Subway Accessibility

New York City is the most populous city in the US with more than 8.8 million people. To support the daily commutes of its residents, NYC also built the New York City Subway, the oldest, longest, and currently busiest subway system in the US, averaging approximately 5.6 million daily rides on weekdays and a combined 5.7 million rides each weekend.

Compared to other US cities where automobiles are the most popular mode of transportation (ahem, Minneapolis), only 32% of NYC’s population chooses to commute by cars. NYC’s far-reaching transit system is then unique, given that more than 70% of the population commute by cars in other metropolitan areas.

Despite having the most extensive transit network in the entire US, NYC is still lacking in terms of transit accessibility for some neighborhoods. The general consensus in academia is that residents who walk more than 0.5 miles to get to reliable transit are considered lacking transportation access, or residing in a transportation desert. For our research, we adopted this concept to study these gaps in transportation access. Specifically, we attempt to identify and study “Subway Deserts”.

Subway Desert Definition

Extending the USDA’s definition of a food desert, we define subway deserts as the percentage of a neighborhood— or any arbitrary geographic area— that is within walking distance of any subway stop. Citing the U.S. Federal Highway Administration, we defined walking distance as a 0.5 mile radius and computed these regions in ArcGIS. We chose subway stations because of the subway’s reliable frequency, high connectivity between boroughs, and high ridership per vehicle. Our argument against including the number of bus stops in our calculations of transportation access is that the quantity of bus stops does not accurately imply public transport accessibility due to the variability in bus efficiency, punctuality, and use. A major limitation of our work was the omission of Staten Island because it is not connected to any other borough by subway. Rather, Staten Island users typically drive or train into the city. Further, we felt that the inclusion of Staten Island would mischaracterize the relationship between lacking access and not needing access since Staten Island is an overwhelmingly white, wealthy, borough that has high levels of car ownership.

We first geocoded subway stop locations in NYC from the NYC Department of Transportation. Then, using ArcGIS we created a 0.5-mile-radius buffer for each station and calculated what percent of each neighborhood was covered by a buffer region. We display an example below.

A nice image.

A nice image.

In the graph, buffer zones are in light pink with overlapping boundaries dissolved between stations, while the dark pink dots indicate the exact geographic locations of the stations. Each neighborhood, then, had a percentage score that defined it’s subway accessibility score.

Upon observation, we categorized the areas served by the subway network into four ordinal categories: Poor, Limited, Satisfactory, and Excellent. These categories are defined at 0-1%, 1-75%, 75-90%, and 90-100% of area covered by transit, respectively. We defined these cutoffs using the distribution of subway coverage percentages and our own judgment on what constitutes a desert. As such, these cutoffs are specific to New York City and may not be perfectly reproducible. The following plot details the spatial locations of these transportation categories.

nyc_compiled %>%
  ggplot() +
  geom_sf(aes(fill = transportation_desert_4cat), color = "#8f98aa") +
  scale_fill_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"),
                       guide = guide_legend(title = "Subway Accessibility \nCategory"), na.value="#D6D6D6") +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Subway Accessibility by \nNeighborhood in NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 35, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

In the following two subsections, we describe how we understood and classified transportation deserts using two models.

Naive Bayes Model

Naive Bayes Model is one of the most popular models for classifying a response variable with 2 or more categories.

We implemented a naive Bayes classifier on subway access because it is both computationally efficient and applicable to Bayesian classification settings where outcomes may have 2+ categories. Specifically, we fit transportation access by taking mean_income, percentage below the poverty line, and the number of grocery stores. Because we are predicting 4 levels of transportation access, we initially fit this model using the e1071 package to classify subway transit level.

library(e1071)

nyc_naive <- read_csv(here("clean_data", "nyc_names_vichy_complied2.csv"))%>%
  mutate(transportation_desert_4cat = factor(transportation_desert_4cat, levels=c("Poor", "Limited", "Satisfactory", "Excellent"))) %>%
  mutate(transportation_desert_4num = factor(as.numeric(transportation_desert_4cat))) %>%
   mutate(asian_perc = (asian_count / total_pop) * 100) %>%
   mutate(white_perc = (white_count / total_pop)* 100) %>%
   mutate(black_perc = (black_count / total_pop)* 100) %>%
   mutate(latinx_perc = (latinx_count / total_pop)* 100) %>%
   mutate(native_perc = (native_count / total_pop)* 100) %>%
   mutate(below_poverty_perc = (below_poverty_line_count / total_pop) * 100) %>%
   mutate(noncitizen_perc = (noncitizen_count / total_pop)* 100) %>%
   mutate(evictions_perc = (eviction_count / total_pop)* 100) %>%
   mutate(uninsured_perc = (uninsured_count / total_pop)* 100) %>%
  mutate(unemployment_perc = (unemployment_count / total_pop)* 100) %>%
  filter(nta_type == 0)

set.seed(454)
naive_model <- naiveBayes(transportation_desert_4cat ~ 
                              mean_income +
                              below_poverty_perc +
                              store_count,
                            data = nyc_naive)
naive2_prediction <- naive_classification_summary_cv(naive_model, 
                                nyc_naive, 
                                y="transportation_desert_4cat", k=10)$cv
naive2_prediction %>%
  kable(align = "c", caption = "Naive Model - Summary") %>% 
  kable_styling()
Naive Model - Summary
transportation_desert_4cat Poor Limited Satisfactory Excellent
Poor 78.12% (25) 12.50% (4) 0.00% (0) 9.38% (3)
Limited 31.43% (11) 11.43% (4) 20.00% (7) 37.14% (13)
Satisfactory 20.69% (6) 17.24% (5) 6.90% (2) 55.17% (16)
Excellent 9.52% (8) 15.48% (13) 2.38% (2) 72.62% (61)

Under 10-fold cross validation, our Naive Bayes model had an overall cross-validated accuracy of 51.11%. However, our predictions were most accurate when predicting Poor transportation access (78.12%) and Excellent transportation access (72.62%). The following plot describes the cross-validated accuracy breakdown by each observed transportation access category.

prediction <- as.data.frame(naive2_prediction) %>%
  pivot_longer(
  cols= Poor:Excellent, 
  names_to = 'Predictions', 
  values_to = 'Probability'
) %>% 
  mutate(Probability = as.numeric(str_extract(Probability,'\\d+\\.\\d+'))/100) %>%
  mutate(transportation_desert_4cat = factor(transportation_desert_4cat, levels=c("Poor", "Limited", "Satisfactory", "Excellent"))) %>%
  mutate(Predictions = factor(Predictions, levels=c("Poor", "Limited", "Satisfactory", "Excellent"))) 


prediction %>%  
  ggplot(aes(x=transportation_desert_4cat, y=Probability, fill=Predictions)) + 
  geom_bar(position="fill", stat="identity")  +
  scale_y_continuous(labels = seq(0, 100, by = 25)) +
  labs(title="Accessibility Predictions by Observed Category", y="Proportion", x="")+
    theme(panel.grid.major.x = element_line("transparent"),
         # axis.text.y.left = element_blank(),
          axis.text.x.bottom = element_text(size = 12, face = "bold"),
          plot.title = element_text(size = 20, hjust=.5, face = "bold", family="DIN Condensed")) +
   scale_fill_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"),
                       guide = guide_legend(title = "Subway Accessibility \nCategory"), na.value="#D6D6D6") 

From the plot, it is clear that our naive Bayes model is sufficient when predicting the extrema of subway (in)access given the overwhelming proportion of true-poor and true-excellent classifications. However, it remains imperfect when considering the inaccuracy for both the limited and satisfactory transportation categories, our data’s distributions, and its interpretability.

Importantly, naive Bayes assumes that all quantitative predictors are normally distributed within each Y category and further it assumes (i.e. that predictors are independent within each Y category). Our data do not meet these assumptions, unfortunately. Further, naive Bayes is a blackbox classifier. That is, naive Bayes classification might give us accurate predictions, but it doesn’t give us a sense of where these predictions come from or subway access is related to the predictors.

Ordinal Model

Having realized the shortcomings of the naive Bayes model, we wanted to see if there are any alternatives. We land on the ordinal regression model.

library(rsample)
set.seed(454)

nyc_compiled_classify <- nyc_compiled %>%
  mutate(transportation_desert_4cat = factor(transportation_desert_4cat, levels=c("Poor", "Limited", "Satisfactory", "Excellent"))) %>%
  mutate(transportation_desert_4num = factor(as.numeric(transportation_desert_4cat))) %>%
   mutate(asian_perc = (asian_count / total_pop) * 100) %>%
   mutate(white_perc = (white_count / total_pop)* 100) %>%
   mutate(black_perc = (black_count / total_pop)* 100) %>%
   mutate(latinx_perc = (latinx_count / total_pop)* 100) %>%
   mutate(native_perc = (native_count / total_pop)* 100) %>%
   mutate(below_poverty_perc = (below_poverty_line_count / total_pop) * 100) %>%
   mutate(noncitizen_perc = (noncitizen_count / total_pop)* 100) %>%
   mutate(uninsured_perc = (uninsured_count / total_pop)* 100) %>%
  mutate(unemployment_perc = (unemployment_count / total_pop)* 100) %>%
  filter(nta_type == 0) %>%
  as.tibble()

data_split <- initial_split(nyc_compiled_classify, prop = .8) 
data_train <- training(data_split)
data_test <- testing(data_split) 
model2 <- stan_polr(as.factor(transportation_desert_4num) ~ mean_income + below_poverty_line_perc + store_count, 
                    data =data_train, prior_counts = dirichlet(1),
                    prior=R2(0.5), iter=500, seed = 86437, refresh=0, prior_PD=FALSE)


tidy(model2, effects = "fixed", conf.int = TRUE, conf.level = 0.8) %>%
  mutate(term = case_when(
    term == "1|2" ~ "Poor | Limited",
    term == "2|3" ~ "Limited | Satisfactory",
    term == "3|4" ~ "Satisfactory | Excellent",
    TRUE ~ term
    ))%>%
  kable(align = "c", caption = "Ordinal Model - Summary") %>% 
  kable_styling() 
Ordinal Model - Summary
term estimate std.error conf.low conf.high
mean_income 0.0000460 0.0000104 0.0000339 0.0000596
below_poverty_line_perc 17.0648982 3.4532734 12.8319749 22.3269332
store_count 0.0189996 0.0052394 0.0126227 0.0254789
Poor | Limited 5.4113154 1.2620341 3.9385520 7.1398347
Limited | Satisfactory 6.7324319 1.3225479 5.2493699 8.5003721
Satisfactory | Excellent 7.7039327 1.3435146 6.1613497 9.5624926

Then using a function written by Connie Zhang, we describe the accuracy of the ordinal model below.

ordinal_accuracy<-function(post_preds,mydata){
  post_preds<-as.data.frame(post_preds)
  results<-c()
  for (j in (1:length(post_preds))){
    results[j]<-as.numeric(tail(names(sort(table(post_preds[,j]))))[4])
    }
  results<-as.data.frame(results)
  compare<-cbind(results,mydata$transportation_desert_4num)
  compare<-compare %>%mutate(results=as.numeric(results))
  compare<-compare %>% mutate(`mydata$transportation_desert_4num`=as.numeric(`mydata$transportation_desert_4num`))
  compare<-compare %>%mutate(accuracy=ifelse(as.numeric(results)==as.numeric(`mydata$transportation_desert_4num`),1,0))
  print(sum(compare$accuracy)/length(post_preds))
}
nyc_compiled[is.na(nyc_compiled)] = 0

set.seed(86437)

my_prediction2 <- posterior_predict(
  model2, 
  newdata = data_test)

ordinal_accuracy(my_prediction2, data_test)
## [1] 0.6388889
my_prediction2 <- posterior_predict(
  model2, 
  newdata = data_test)

prediction_long <- my_prediction2 %>% 
  t() %>% 
  as.tibble() %>%
  mutate_if(is.character, as.numeric) %>%
  rownames_to_column() %>% 
  rowwise(id=rowname) %>%
  summarize(median=ifelse(mean(c_across(where(is.numeric)))>3.5, ceiling(mean(c_across(where(is.numeric)))), floor(mean(c_across(where(is.numeric)))))) %>%
  rename(predicted_desert = median)%>%
  mutate(predicted_desert = case_when(
    predicted_desert==1 ~ "Poor",
    predicted_desert ==2 ~ "Limited",
    predicted_desert ==3 ~ "Satisfactory",
    TRUE ~ "Excellent",
  ))  %>%
  mutate(predicted_desert = factor(predicted_desert, levels=c("Poor", "Limited", "Satisfactory", "Excellent")))

data_test %>%
  dplyr::select(nta_id, borough, transportation_desert_4cat) %>%
  rownames_to_column() %>%
  left_join(., prediction_long, by="rowname") %>%
  
  ggplot(aes(x=transportation_desert_4cat, fill=predicted_desert)) + 
  geom_bar(position="fill")+
  scale_y_continuous(labels = seq(0, 100, by = 25)) +
  labs(title="Accessibility Predictions by Observed Category", y="Proportion", x="")+
    theme(panel.grid.major.x = element_line("transparent"),
         # axis.text.y.left = element_blank(),
          axis.text.x.bottom = element_text(size = 12, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size =20, hjust=.5, face = "bold")) +
   scale_fill_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"),
                       guide = guide_legend(title = "Subway Accessibility \nCategory"), na.value="#D6D6D6")

Transportation and Structural Inequity

Transportation access is a pervasive structural issue. However, previous research on transportation access has demonstrated that many of these gaps in access also deepen the chasms of racial and class-based inequity. In this analysis, it seemed that low-income and racialized— typically Black and Latinx— communities were most likely to confront transportation inequities. Additionally, we have observed that predominantly Black and Latinx neighborhoods in the Bronx faced some of the highest rates of eviction, likely indicating that these inequities may overlap.

This next section aims to connect transportation access to housing-inequities that we know also have racial and class dimensions. In particular, we wanted to assess transportation access’s relationship with immigrant community size, rental prices, and eviction counts by neighborhood.

desert_map <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = transportation_desert_4cat), color = "#8f98aa") +
  scale_fill_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"),
                       guide = guide_legend(title = "Subway Accessibility \nCategory"), na.value="#D6D6D6") +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Subway Access in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

rent_map <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = mean_rent), color = "#8f98aa") +
  scale_fill_gradient(low = "#FCF5EE", high = "#30969C",
                      guide = guide_legend(title = "Dollars")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Mean Rent in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12),
          legend.text = element_text(size = 12)) +
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

evict <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = eviction_count), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#C47572", 
                      guide = guide_legend(title = "Number of Evictions")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Eviction Counts in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

noncit <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = noncitizen_perc), color = "#8f98aa") +
scale_fill_gradient(low = "#FCF5EE", high = "#5E76AD",
                    guide = guide_legend(title = "Percent Non-Citizen")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Immigrant Density in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))
library(egg)
ggarrange(desert_map, rent_map, evict, noncit,
          ncol=2, nrow=2)

Transportation access is typically worst in areas with the highest densities of non-citizen residents, while it is the best in neighborhoods with the highest mean rental prices. Further, observe that eviction counts are highly concentrated in north and south NYC, where some of these neighborhoods have mixed-accessibility to transit.

Unfortunately, rent, transportation access, and eviction counts may also be associated with respective density of nonwhite communities.

white <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = white_perc), color = "#8f98aa") +
  scale_fill_gradientn(colors = c("#FCF5EE","#919BB6", "#5A6687", "#3D465C"),  guide = guide_legend(title = "Percent White")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("White Population \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

black <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = black_perc), color = "#8f98aa") +
  scale_fill_gradientn(colors = c("#FCF5EE","#F8ABA6", "#F58581", "#DB3F37"), guide = guide_legend(title = "Percent Black")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Black Population \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

asian <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = asian_perc), color = "#8f98aa") +
  scale_fill_gradientn(colors = c("#FCF5EE","#C1C5EC", "#A1A8E2", "#4451C5"),                      guide = guide_legend(title = "Percent Asian")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Asian Population \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

latinx <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot() +
  geom_sf(aes(fill = latinx_perc), color = "#8f98aa")+
  scale_fill_gradientn(colors = c("#FCF5EE","#F4E2B8", "#E9C572", "#E77E2F"), 
                      guide = guide_legend(title = "Percent Latinx")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Latinx Population \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))
ggarrange(desert_map, white, black, latinx, asian, ncol=3, nrow=2)

From the plots, neighborhoods with the highest densities of Black and Asian community members also have the poorest scores of subway access, while the converse holds true for neighborhoods with the highest proportion of White residents.

Further, observe that eviction counts are most common in neighborhoods with the highest densities of Black and Latinx community members. The below faceted visualizations detail the specific relationships between the proportion of Black, Latinx, Asian, and White residents in a neighborhood with eviction counts.

black_eviction <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=black_perc,
             color=transportation_desert_4cat, 
             y=eviction_count)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Eviction Counts by Black Density \nper Transportation Category", y="", x="Black (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 30 ,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

asian_eviction <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=asian_perc,
             color=transportation_desert_4cat, 
             y=eviction_count)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Eviction Counts by Asian Density \nper Transportation Category", y="", x="Asian (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 30,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

latinx_eviction <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=latinx_perc,
             color=transportation_desert_4cat, 
             y=eviction_count)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Eviction Counts by Latinx Density \nper Transportation Category", y="", x="Latinx (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 30,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

white_eviction <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=white_perc,
             color=transportation_desert_4cat, 
             y=eviction_count)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Eviction Counts by White Density \nper Transportation Category", y="", x="White (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 30,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 
ggarrange(black_eviction, latinx_eviction, asian_eviction, white_eviction, ncol=2, nrow=2)

Black and Latinx residential proportions by neighborhood are associated with increased eviction counts. However, we must specify that Black resident proportions are uniformly associated with increases in eviction counts across transportation levels, while the relationship between eviction counts and Latinx resident proportion is not. In contrast, both White and Asian proportions are uniformly associated with decreases in eviction counts.

Lastly, let’s consider mean neighborhood rental prices. The following visualizations detail the relationships between the proportion of Black, Latinx, Asian, and White residents in a neighborhood with that neighborhood mean rental price.

black_rent <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=black_perc,
             color=transportation_desert_4cat, 
             y=mean_rent)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Neighborhood Rental Price Averages \nby Black Density per Transportation Category", y="", x="Black (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size =25,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

asian_rent <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=asian_perc,
             color=transportation_desert_4cat, 
             y=mean_rent)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Neighborhood Rental Price Averages \nby Asian Density per Transportation Category", y="", x="Asian (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 25,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

latinx_rent <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=latinx_perc,
             color=transportation_desert_4cat, 
             y=mean_rent)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Neighborhood Rental Price Averages \nby Latinx Density per Transportation Category", y="", x="Latinx (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 25,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 

white_rent <- nyc_compiled %>%
  filter(nta_type == 0) %>%
  ggplot(aes(x=white_perc,
             color=transportation_desert_4cat, 
             y=mean_rent)) +
  geom_point(size=2)+
  geom_smooth(aes(group=transportation_desert_4cat),span = 1, size=3, se=F)+
  labs(title="Neighborhood Rental Price Averages \nby White Density per Transportation Category", y="", x="White (%)")+
  theme_linedraw()+
    theme(legend.position = "none",
         # axis.text.y.left = element_blank(),
        #  axis.text.x.bottom = element_text(size = 16, face = "bold"),
          plot.title = element_text(family="DIN Condensed", size = 25,hjust=.5, face = "bold"),
    strip.background.x = element_rect(fill="Black", color="Black"),
    strip.text.x = element_text(size = 14, 
                                color = "White",
                                face = "bold")) + 
   scale_color_manual(values=c("#895F32","#E9DBC2","#7D9B8A", "#395645"))+  facet_wrap(.~transportation_desert_4cat, scales = "free") 
ggarrange(black_rent, latinx_rent, asian_rent, white_rent, ncol=2, nrow=2)

Increases in White-resident proportions were uniformly associated with increases in average neighborhood rental prices across all transportation access categories. The converse is true for the relationship between Black and Latinx neighborhood densities and rental prices. It then seems that despite living in cheaper neighborhoods, both NYC’s Black and Latinx communities are carrying the largest burden of eviction.

Our next section details the statistical models we used to better characterize the clear relationships between housing inequities, urban racism, and transportation access.

Non-Hierarchical Models

Bayesian Model Specification

In order to understand the respective distributions of immigrant population size, evictions, and mean rental prices in NYC, we fit 3 non-hierarchical Bayesian models with each variable as an outcome. In previous iterations, we fit hierarchical models on these same relationships where neighborhood values were grouped by borough. We initially felt that the internal diversity of boroughs with respect to housing, demographics, and wealth made it more appropriate as a grouping variable since it did not cary a coherent set of effects. However, we ultimately saw that the non-hierarchical models we fit were largely comparable to their hierarchical counterparts. Additionally, we found that borough-level policies and housing plans did have meaningful effects on eviction and rental prices. As such, we have decided it should be considered as a covariate we adjust for, rather than a grouping variable.

We list our non-hierarchical models and their predictors below:

  • Non-Citizen Count Model (1)
    • Predictors: transportation_desert_4cat, total_pop, borough, gini_neighborhood, mean_income, mean_rent, unemployment_perc, black_perc, latinx_perc, asian_perc
  • Average Rental Price Model (2)
    • Predictors: transportation_desert_4cat, borough, gini_neighborhood, mean_income, black_perc, latinx_perc, asian_perc, bus_count,school_count,store_count, noncitizen_perc
  • Eviction Count Model (3)
    • Predictors: transportation_desert_4cat, total_pop, borough, gini_neighborhood, mean_income, mean_rent, black_perc, latinx_perc, asian_perc, bus_count,store_count,unemployment_perc, uninsured_perc

Across all models we specified normal priors for the $\beta_{k}$s associated with each predictor.

For 1 and 3, we used weakly informative normal priors on all predictors and the intercept, then allowed stan_glm to estimate initial priors. This decision was ultimately due to our unfamiliarity with NYC’s history of evictions and non-citizen population. We also fit other versions of models 1 and 2 that specified a Poisson distribution of eviction and non-citizen counts. However, we observed an inconsistent spread of variance and increased 0 counts in both the eviction and immigrant count data and ultimately decided that a Negative-Binomial distribution was more appropriate. As such, we have removed these Poisson-distributed models from our discussion here.

For Model 2, we also used weakly informative normal priors on all predictors. However, we specified a normal prior with $\mu = 600$ and $\sigma = 20$ for the intercept of mean rental prices. We chose these parameters specifically using previous experience as renters both in NYC and other comparable major cities.

Our model specifications are detailed in the following subsections

Model 1: Immigrant/Non-Citizen Count

modeling_data <- nyc_compiled  %>%
  mutate(black_perc = black_perc * 10) %>%
  mutate(white_perc = white_perc * 10) %>%
  mutate(latinx_perc = latinx_perc * 10) %>%
  mutate(asian_perc = asian_perc * 10) %>%
  mutate(native_perc = native_perc * 10) %>%
  mutate(unemployment_perc = unemployment_perc * 5) %>%
  mutate(uninsured_perc = uninsured_perc * 5) %>%
  mutate(mean_income = mean_income / 100) %>%
  mutate(mean_rent = mean_rent / 100) %>%
  filter(nta_type==0) %>%
  mutate(nta_type = as.factor(nta_type))
noncit_model <- stan_glm(
  noncitizen_count ~
    transportation_desert_4cat + 
    total_pop + borough + gini_neighborhood + 
    mean_income + mean_rent + unemployment_perc + 
    black_perc + latinx_perc + asian_perc,
  data = modeling_data,
  family = neg_binomial_2,
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)

\[\begin{split} \text{Non-Citizen Count} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu, r) \; \; \; \; \text{where} \log(\mu) = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_{0c} &\sim N(0,2.5^2)\\ \beta_{1} &\sim N(0,6.2785^2)\\ \beta_{2} &\sim N(0,6.7918^2)\\ \beta_{3} &\sim N(0,5.0879^2)\\ \beta_{4} &\sim N(0, 0.0001^2)\\ \beta_{5} &\sim N(0,5.5216^2)\\ \beta_{6} &\sim N(0,6.5781^2)\\ \beta_{7} &\sim N(0,5.2519^2)\\ \beta_{8} &\sim N(0,56.96^2)\\ \beta_{9} &\sim N(0,0.006^2)\\ \beta_{10} &\sim N(0,0.3317^2)\\ \beta_{11} &\sim N(0,7.3986^2)\\ \beta_{12} &\sim N(0,0.9779^2)\\ \beta_{13} &\sim N(0,1.0952^2)\\ \beta_{14} &\sim N(0,1.638^2)\\ r & \sim Exp(1) \\ \end{split}\]

Model 2: Mean Neighborhood Rental Prices

rent_model <- stan_glm(
  mean_rent  ~ 
  transportation_desert_4cat + 
    bus_count + school_count + store_count +
    total_pop + borough + 
    gini_neighborhood + mean_income + black_perc + latinx_perc + asian_perc,
  data = modeling_data, 
  family = gaussian,
  prior_intercept = normal(1600 , 10),
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)
tidy(rent_model, conf.int = TRUE, conf.level = 0.8)
## # A tibble: 16 x 5
##    term                                 estimate std.error   conf.low  conf.high
##    <chr>                                   <dbl>     <dbl>      <dbl>      <dbl>
##  1 (Intercept)                         8.32      2.72       4.86         1.17e+1
##  2 transportation_desert_4catLimited  -0.144     0.450     -0.745        4.20e-1
##  3 transportation_desert_4catSatisfa…  0.582     0.489     -0.0518       1.21e+0
##  4 transportation_desert_4catExcelle…  0.355     0.500     -0.285        9.76e-1
##  5 bus_count                          -0.00361   0.00509   -0.0102       2.89e-3
##  6 school_count                       -0.0285    0.0281    -0.0643       6.89e-3
##  7 store_count                         0.0104    0.00606    0.00288      1.80e-2
##  8 total_pop                          -0.0000136 0.0000116 -0.0000285    1.03e-6
##  9 boroughBrooklyn                     0.0574    0.475     -0.522        6.53e-1
## 10 boroughManhattan                    0.145     0.569     -0.559        8.89e-1
## 11 boroughQueens                       0.321     0.487     -0.320        9.20e-1
## 12 gini_neighborhood                  -2.48      4.92      -8.91         3.82e+0
## 13 mean_income                         0.0120    0.000634   0.0113       1.28e-2
## 14 black_perc                         -0.102     0.0745    -0.198       -6.30e-3
## 15 latinx_perc                         0.0805    0.107     -0.0575       2.26e-1
## 16 asian_perc                          0.199     0.116      0.0514       3.49e-1

\[\begin{split} \text{Mean Rent} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{Normal}(\mu, \sigma) \; \; \; \; \text{where} \mu = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_ {0c} &\sim N(1600,20^2)\\ \beta_{1} &\sim N(0,47.315^2)\\ \beta_{2} &\sim N(0,51.1837^2)\\ \beta_{3} &\sim N(0,38.3432^2)\\ \beta_{4} &\sim N(0,0.4958^2)\\ \beta_{5} &\sim N(0,2.9378^2)\\ \beta_{6} &\sim N(0,0.4371^2)\\ \beta_{7} &\sim N(0,0.0008^2)\\ \beta_{8} &\sim N(0,41.6113^2)\\ \beta_{9} &\sim N(0,49.5728^2)\\ \beta_{10} &\sim N(0,39.5783^2)\\ \beta_{11} &\sim N(0,429.2544^2)\\ \beta_{12} &\sim N(0,0.0454^2)\\ \beta_{13} &\sim N(0,7.3695^2)\\ \beta_{14} &\sim N(0,8.2532^2)\\ \beta_{15} &\sim N(0,12.3441^2)\\ \sigma & \sim Exp(0.13) \\ \end{split}\]

We chose our prior specifications of mean rental price ($\beta_{0c}$) using Juthi’s experience renting in NYC and a group conversation about typical rental prices we would elect to pay in NYC, Los Angeles, and other major cities we have lived in or around.

Model 3: Eviction Count

eviction_model <- stan_glm(
  eviction_count  ~ 
    transportation_desert_4cat +
    total_pop + borough + 
    gini_neighborhood + mean_income +  mean_rent + 
    black_perc + latinx_perc + asian_perc,
  data = modeling_data, 
  family = neg_binomial_2,
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)

\[\begin{split} \text{Eviction Count} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu, r) \; \; \; \; \text{where} \log(\mu) = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_{0c} &\sim N(0,2.5^2)\\ \beta_{1} &\sim N(0,6.2992^2)\\ \beta_{2} &\sim N(0,6.7813^2)\\ \beta_{3} &\sim N(0,4.9972^2)\\ \beta_{4} &\sim N(0,1e-04^2)\\ \beta_{5} &\sim N(0,5.4697^2)\\ \beta_{6} &\sim N(0,6.443^2)\\ \beta_{7} &\sim N(0,5.3347^2)\\ \beta_{8} &\sim N(0,56.9002^2)\\ \beta_{9} &\sim N(0,0.0074^2)\\ \beta_{10} &\sim N(0,0.5699^2)\\ \beta_{11} &\sim N(0,0.993^2)\\ \beta_{12} &\sim N(0,1.142^2)\\ \beta_{13} &\sim N(0,1.6003^2)\\ r & \sim Exp(1) \\ \end{split}\]

In the following sections, we go through each particular model’s outcome and what they tell us about the relationships between transportation and housing.

Model Performance

Model 1: Immigrant/Non-Citizen Count

pp_check(noncit_model)+ 
  xlab("Non-Citizen Count") +
  labs(title = "Negative Binomial Model of \nImmigrant/Non-Citizen Count")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

non_citizen_clean <- modeling_data %>%
  na.omit()%>%
  arrange(noncitizen_count)

list <- non_citizen_clean %>%
  arrange(noncitizen_count) 


set.seed(84735)

predictions_non_citizen <-  posterior_predict(
  noncit_model, newdata = non_citizen_clean)

ppc_intervals_grouped(
  y = list$noncitizen_count, 
  group = list$borough,
  yrep = predictions_non_citizen,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = list$nta_id,
    breaks = 1:nrow(list)) +
  
  labs(x = "Neighborhood", y = "Non-Citizen Count")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

modeling_data <- modeling_data %>%
  mutate(borough = as.factor(borough))
tidy(noncit_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  
  mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100), 
         conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100), 
         conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Non-Citizen Model - Summary") %>% 
  kable_styling()
Non-Citizen Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 651.2319316 0.6039926 291.2988379 1419.8725706
transportation_desert_4catLimited 21.7379818 0.0950589 7.9879139 38.1357707
transportation_desert_4catSatisfactory 30.4229224 0.1026368 14.2878358 48.4871807
transportation_desert_4catExcellent 29.9883418 0.1021175 13.7896293 48.0654598
total_pop 0.0025347 0.0000016 0.0023291 0.0027492
boroughBrooklyn 16.5305117 0.1031461 2.1538125 33.2974749
boroughManhattan 20.4594519 0.1284852 2.3679585 42.7420163
mean_income -0.0786004 0.0002464 -0.1104105 -0.0479554
mean_rent 6.2621791 0.0173819 3.9677153 8.6604060
black_perc 5.2123981 0.0161369 3.0066056 7.3761976
latinx_perc 13.5353198 0.0239178 10.1627212 16.9991670
asian_perc 19.7315251 0.0248588 15.8858923 23.5947406
  • Limited Subway Access: When controlling for all other predictors, a neighborhood with limited transit access is expected to have approximately 21.82% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (7.48%, 38.27%) non citizen residents, indicating that neighborhoods with limited transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Satisfactory Subway Access: When controlling for all other predictors, a neighborhood with satisfactory transit access is expected to have approximately 30.22% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.99%, 49.09%) non citizen residents, indicating that neighborhoods with satisfactory transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Excellent Subway Access: When controlling for all other predictors, a neighborhood with excellent transit access is expected to have approximately 29.84% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.34%, 48.51%) non citizen residents, indicating that neighborhoods with excellent transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Brooklyn: When controlling for all other predictors, a neighborhood in Brooklyn is expected to have approximately 16.61% more non-citizens than a neighborhood in the Bronx. There’s an 80% probability that this increase could lie anywhere between (1.68%, 34.07) non citizen residents, indicating that neighborhoods in Brooklyn almost certainly have more non citizen residents than the Bronx, but the magnitude of this increase may vary.

  • Manhattan: When controlling for all other predictors, a neighborhood in Manhattan is expected to have approximately 20.58% more non-citizens than in the Bronx. There’s an 80% probability that this increase could lie anywhere between (2.268, 42.85) non citizen residents, indicating that neighborhoods in Manhattan almost certainly have more non citizen residents than the Bronx.

  • Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a 0.0793% decrease in non citizen count. However, there is a 80% chance that the decrease in non citizen count may be any value between (0.1102, 0.0476), indicating that there is almost certainly a negative relationship between mean income and non citizen count, but its magnitude may vary.

  • Mean Rent: When controlling for all other predictors, a 100 dollar increase in mean neighborhood rent is associated with approximately a 6.29% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (3.96%, 8.68%), indicating that there is almost certainly a positive relationship between mean rent and non citizen count, but its magnitude may vary.

  • Black Percentage: When controlling for all other predictors, a 10% increase in the Black population in a neighborhood is associated with approximately a 5.20% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (2.99%, 7.45%), indicating that there is almost certainly a positive relationship between Black resident percentage and non citizen count, but its magnitude may vary.

  • Latinx Percentage: When controlling for all other predictors, a 10% increase in the Latinx population in a neighborhood is associated with approximately a 13.46% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (10.08%, 17.07%), indicating that there is almost certainly a positive relationship between Latinx resident percentage and non citizen count, but its magnitude may vary.

  • Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a 19.63% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (15.75%, 23.69%), indicating that there is almost certainly a positive relationship between Asian resident percentage and non citizen count, but its magnitude may vary.

Model 2: Mean Neighborhood Rental Prices

pp_check(rent_model)+ 
  xlab("Mean Rental Price") +
  labs(title = "Normal Model of \nMonthly Mean Rental Price")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

rent_clean <- modeling_data %>%
  na.omit()%>%
  arrange(mean_rent)

list <- rent_clean %>%
  arrange(mean_rent) 


set.seed(84735)

predictions_rent <-  posterior_predict(
  rent_model, newdata = rent_clean)


ppc_intervals_grouped(
  y = list$mean_rent, 
  group = list$borough,
  yrep = predictions_rent,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = list$nta_id,
    breaks = 1:nrow(list)) +
  
  labs(x = "Neighborhood", y = "Mean Rental Price (100$)")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

tidy(rent_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Rent Model - Summary") %>% 
  kable_styling()
Rent Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 8.3199088 2.7157180 4.8635152 11.7212059
store_count 0.0103824 0.0060634 0.0028805 0.0179518
mean_income 0.0120492 0.0006336 0.0112771 0.0128432
black_perc -0.1022434 0.0744661 -0.1984486 -0.0063018
asian_perc 0.1989197 0.1160606 0.0513907 0.3491269

For rent model:

  • Store Count: When controlling for all other predictors, as grocery store and food vendor counts in a neighborhood increase by 1, rent increases by approximately $1.049 dollars. However, there is a 80% chance that the increase associated with store count may be any value between (0.307, 1.799) dollars, indicating that there is almost certainly a positive relationship between store count and rent, but its magnitude may vary slightly.

  • Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a $1.20 increase in rent. However, there is a 80% chance that the increase associated with rental price increases may be any value between (1.129, 1.284) dollars, indicating that there is almost certainly a positive relationship between mean income and mean rental prices, but its magnitude may vary slightly.

  • Black Percentage: When controlling for all other predictors, a 10% increase in the black proportion of a neighborhood is associated with approximately a $10.20 decrease in rent. However, there is an 80% chance that the decrease associated with rent may be any value between (19.36, 0.685) dollars, indicating that there is almost certainly a negative relationship between black resident percentage and rent, but its magnitude may vary.

  • Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a $20.52 increase in rent. However, there is an 80% chance that the increase associated with rent may be any value between (5.763, 34.87), indicating that there is almost certainly a positive relationship between Asian resident percentage and rent, but its magnitude may vary greatly.

Model 3: Eviction Count

pp_check(eviction_model)+ 
  xlab("Eviction Count") +
  xlim(0,2000)+
  labs(title = "Negative Binomial Model of \nEviction Count")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

eviction_clean <- modeling_data %>%
  na.omit()%>%
  arrange(eviction_count)

list <- eviction_clean %>%
  as.tibble() %>%
  arrange(eviction_count) 


set.seed(454)

predictions_eviction <-  posterior_predict(
  eviction_model, newdata = eviction_clean)

ppc_intervals_grouped(
  y = list$eviction_count, 
  group = list$borough,
  yrep = predictions_eviction,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = eviction_clean$nta_id,
    breaks = 1:nrow(eviction_clean)) +
  
  labs(x = "Neighborhood", y = "Eviction Counts")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

tidy(eviction_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  
  mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100), 
         conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100), 
         conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Eviction Model - Summary") %>% 
  kable_styling()
Eviction Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 25.3708253 0.6672333 10.7399666 61.1651284
transportation_desert_4catLimited 40.3576113 0.1135317 21.6160119 63.2168111
transportation_desert_4catSatisfactory 41.5652982 0.1172567 21.4412471 65.1689188
transportation_desert_4catExcellent 41.1274899 0.1195488 21.2557143 64.4415646
total_pop 0.0022456 0.0000019 0.0020074 0.0024987
boroughBrooklyn -39.6055763 0.1167156 -48.1492101 -30.0556835
boroughManhattan -19.1336140 0.1424367 -32.7723341 -2.6146703
boroughQueens -29.0124612 0.1159364 -39.1037343 -17.1349961
gini_neighborhood 625.9717738 1.1531206 62.9663247 3224.4994674
mean_income -0.1185831 0.0002960 -0.1560185 -0.0806676
mean_rent 4.5237946 0.0197216 1.9896219 7.1613213
black_perc 17.8444198 0.0195289 14.9659855 20.7056962
latinx_perc 6.6264818 0.0282634 2.6585487 10.7641163
asian_perc -4.1810113 0.0324296 -7.9335617 -0.2019898

After removing predictors whose 80% credible intervals included the possibility of non-effect when controlling for other covariates, we found that there were 13 remaining predictors of an arbitrary neighborhood’s eviction counts. In the following list, categorical predictors’ classes were listed together:

  • Subway Accessibility (Limited, Satisfactory, Excellent)
  • Total Population
  • Borough (Brooklyn, Manhattan, Queens)
  • Gini Index
  • Mean Income
  • Mean Rent
  • Black Percentage
  • Latinx Percentage
  • Asian Percentage

In the following section, we interpret each predictor. However, we included total population to control for eviction count’s incidence relative to neighborhood population. As such, we will not interpret its specific \(\beta\) value.

Next, we interpret each predictor:

  • Limited Subway Access: When controlling for all other predictors, a neighborhood with limited access to the subway is expected to have approximately 39.81% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (21.10%, 61.52%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Satisfactory Subway Access: When controlling for all other predictors, a neighborhood with satisfactory access to the subway is expected to have approximately 41.25% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (21.89%, 64.39%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Excellent Subway Access: When controlling for all other predictors, a neighborhood with excellent access to the subway is expected to have approximately 40.58% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (20.67%, 63.63%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Gini Index: When controlling for all other predictors, 1 increases in income inequality are associated with approximately 658.52% increases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (73.738%, 3243.93%), indicating that there is almost certainly a strong, positive, relationship between these two variables, but its magnitude may vary.
  • Mean Income: When controlling for all other predictors, 100 dollar increases in mean neighborhood income are associated with approximately 0.1182% decreases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (-0.0799%, -0.1553%), indicating that there is almost certainly a weak, negative, relationship between these two variables, but its magnitude may vary slightly.
  • Mean Rent: When controlling for all other predictors, 100 dollar increases in mean neighborhood rental prices are associated with approximately 4.55% increases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (1.94%, 7.13%), indicating that there is almost certainly a positive relationship between these two variables, but its magnitude may vary slightly.
  • Black Percent: When controlling for all other predictors, 10% increases in the proportion of Black residents in a neighborhood are associated with approximately 17.77% increases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (15.03%, 20.67%), indicating that there is almost certainly a strong, positive, relationship between these two variables, but its magnitude may vary slightly.
  • Latinx Percent: When controlling for all other predictors, 10% increases in the proportion of Latinx residents in a neighborhood are associated with approximately 6.60% increases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (2.73%, 10.69%), indicating that there is almost certainly a weak, positive, relationship between these two variables, but its magnitude may vary.
  • Asian Percent: When controlling for all other predictors, 10% increases in the proportion of Asian residents in a neighborhood are associated with approximately 4.21% decreases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (-.2344%, -7.88%), indicating that there is almost certainly a weak, negative, relationship between these two variables, but its magnitude may vary slightly.

Hierarchical Models

Bayesian Model Specification

We also fit 3 Bayesian hierarchical models similar to those in the previous section. Now, however, we let borough be a natural grouping variable of our data. Note that prior distributions for predictors’ associated \(\beta\) values and the data are the same.

We list our hierarchical models and their predictors below:

  • Non-Citizen Count Model (1)
    • Predictors: transportation_desert_4cat, total_pop, gini_neighborhood, mean_income, mean_rent, unemployment_perc, black_perc, latinx_perc, asian_perc
    • Grouping: borough
  • Average Rental Price Model (2)
    • Predictors: transportation_desert_4cat, gini_neighborhood, mean_income, black_perc, latinx_perc, asian_perc, bus_count,school_count,store_count, noncitizen_perc
    • Grouping: borough
  • Eviction Count Model (3)
    • Predictors: transportation_desert_4cat, total_pop, gini_neighborhood, mean_income, mean_rent, black_perc, latinx_perc, asian_perc, bus_count,store_count,unemployment_perc, uninsured_perc
    • Grouping: borough

Again for 1 and 3, we used weakly informative priors and allowed stan_glm to estimate initial priors. This decision was ultimately due to our unfamiliarity with NYC’s history of evictions and non-citizen population. For Model 2, we specified a normal prior with mean 1600 and standard deviation at 20 using previous experience as renters both in NYC and other comparable major cities. Across all count models, we fit both Poisson and Negative-Binomial likelihood distributions for the observed eviction and non-citizen counts. However, we observed an inconsistent spread of variance and increased 0 counts in both the eviction and immigrant count data. As such, we felt that a Negative-Binomial distribution was more appropriate for both data.

Our model specifications are detailed in the following subsections

Model 1: Immigrant/Non-Citizen Count

hi_noncit_model <- stan_glmer(
  noncitizen_count ~
    transportation_desert_4cat + 
    total_pop + gini_neighborhood + 
    mean_income + mean_rent + unemployment_perc + 
    black_perc + latinx_perc + asian_perc + (1|borough),
  data = modeling_data,
  family = neg_binomial_2,
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)

\[\begin{split} \text{Non-Citizen Count} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu, r) \; \; \; \; \text{where} \log(\mu) = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_{0c} &\sim N(0,2.5^2)\\ \beta_{1} &\sim N(0,6.2785^2)\\ \beta_{2} &\sim N(0,6.7918^2)\\ \beta_{3} &\sim N(0,5.0879^2)\\ \beta_{4} &\sim N(0, 0.0001^2)\\ \beta_{5} &\sim N(0,5.5216^2)\\ \beta_{6} &\sim N(0,6.5781^2)\\ \beta_{7} &\sim N(0,5.2519^2)\\ \beta_{8} &\sim N(0,56.96^2)\\ \beta_{9} &\sim N(0,0.006^2)\\ \beta_{10} &\sim N(0,0.3317^2)\\ \beta_{11} &\sim N(0,7.3986^2)\\ \beta_{12} &\sim N(0,0.9779^2)\\ \beta_{13} &\sim N(0,1.0952^2)\\ \beta_{14} &\sim N(0,1.638^2)\\ r & \sim Exp(1) \\ \end{split}\]

Model 2: Mean Neighborhood Rental Prices

hi_rent_model <- stan_glmer(
  mean_rent  ~ 
  transportation_desert_4cat + 
    transportation_desert_4cat + 
    bus_count + school_count + store_count +
    total_pop + borough + 
    gini_neighborhood + mean_income + black_perc + latinx_perc + asian_perc+ (1 | borough),
  data = modeling_data, 
  family = gaussian,
  prior_intercept = normal(1600 , 20),
  # prior = hs_prior,
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)

\[\begin{split} \text{Mean Rent} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{Normal}(\mu, \sigma) \; \; \; \; \text{where} \mu = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_ {0c} &\sim N(1600,20^2)\\ \beta_{1} &\sim N(0,47.315^2)\\ \beta_{2} &\sim N(0,51.1837^2)\\ \beta_{3} &\sim N(0,38.3432^2)\\ \beta_{4} &\sim N(0,0.4958^2)\\ \beta_{5} &\sim N(0,2.9378^2)\\ \beta_{6} &\sim N(0,0.4371^2)\\ \beta_{7} &\sim N(0,0.0008^2)\\ \beta_{8} &\sim N(0,41.6113^2)\\ \beta_{9} &\sim N(0,49.5728^2)\\ \beta_{10} &\sim N(0,39.5783^2)\\ \beta_{11} &\sim N(0,429.2544^2)\\ \beta_{12} &\sim N(0,0.0454^2)\\ \beta_{13} &\sim N(0,7.3695^2)\\ \beta_{14} &\sim N(0,8.2532^2)\\ \beta_{15} &\sim N(0,12.3441^2)\\ \sigma & \sim Exp(0.13) \\ \end{split}\]

We chose our prior specifications of mean rental price ($\beta_{0c}$) using Juthi’s experience renting in NYC and a group conversation about typical rental prices we would elect to pay in NYC, Los Angeles, and other major cities we have lived in or around.

Model 3: Eviction Count

hi_eviction_model <- stan_glmer(
  eviction_count  ~ 
    transportation_desert_4cat +
    total_pop + 
    gini_neighborhood + mean_income +  mean_rent + 
    black_perc + latinx_perc + asian_perc + (1|borough),
  data = modeling_data, 
  family = neg_binomial_2,
  chains = 4, iter = 1000*2, seed = 84735, refresh = 0
)

\[\begin{split} \text{Eviction Count} \mid \beta_{0c}, \beta_1, ..., \beta_k, r & \sim \text{NegBin}(\mu, r) \; \; \; \; \text{where} \log(\mu) = \beta_{0c} + \sum^{14}_{k=1}X_{k}\beta_k \\ \beta_{0c} &\sim N(0,2.5^2)\\ \beta_{1} &\sim N(0,6.2992^2)\\ \beta_{2} &\sim N(0,6.7813^2)\\ \beta_{3} &\sim N(0,4.9972^2)\\ \beta_{4} &\sim N(0,1e-04^2)\\ \beta_{5} &\sim N(0,5.4697^2)\\ \beta_{6} &\sim N(0,6.443^2)\\ \beta_{7} &\sim N(0,5.3347^2)\\ \beta_{8} &\sim N(0,56.9002^2)\\ \beta_{9} &\sim N(0,0.0074^2)\\ \beta_{10} &\sim N(0,0.5699^2)\\ \beta_{11} &\sim N(0,0.993^2)\\ \beta_{12} &\sim N(0,1.142^2)\\ \beta_{13} &\sim N(0,1.6003^2)\\ r & \sim Exp(1) \\ \end{split}\]

In the following sections, we go through each particular model’s outcome and what they tell us about the relationships between transportation and housing.

Model Performance

Model 1: Immigrant/Non-Citizen Count

pp_check(hi_noncit_model)+ 
  xlab("Non-Citizen Count") +
  labs(title = "Hierarchical Negative Binomial Model of \nImmigrant/Non-Citizen Count")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

non_citizen_clean <- modeling_data %>%
  na.omit()%>%
  arrange(noncitizen_count)

list <- non_citizen_clean %>%
  arrange(noncitizen_count) 


set.seed(84735)

predictions_non_citizen <-  posterior_predict(
  hi_noncit_model, newdata = non_citizen_clean)

ppc_intervals_grouped(
  y = list$noncitizen_count, 
  group = list$borough,
  yrep = predictions_non_citizen,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = list$nta_id,
    breaks = 1:nrow(list)) +
  
  labs(x = "Neighborhood", y = "Non-Citizen Count")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

tidy(hi_noncit_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  
  mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100), 
         conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100), 
         conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Hierarchical Non-Citizen Model - Summary") %>% 
  kable_styling()
Hierarchical Non-Citizen Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 668.7555749 0.4822638 352.7867633 1239.9553891
transportation_desert_4catLimited 21.2348654 0.0927826 7.3734445 36.5618823
transportation_desert_4catSatisfactory 30.5148214 0.1039224 14.3928645 48.6004626
transportation_desert_4catExcellent 31.3940307 0.1001600 15.1289340 49.1993498
total_pop 0.0025713 0.0000016 0.0023639 0.0027890
mean_income -0.0790660 0.0002340 -0.1092396 -0.0477299
mean_rent 6.4009228 0.0171000 4.0019748 8.7813769
black_perc 5.2586629 0.0166608 3.1614027 7.4417223
latinx_perc 12.7012134 0.0232048 9.3858206 15.9996110
asian_perc 19.7701025 0.0242585 16.0228215 23.6661426
  • Limited Subway Access: When controlling for all other predictors, a neighborhood with limited transit access is expected to have approximately 21.82% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (7.48%, 38.27%) non citizen residents, indicating that neighborhoods with limited transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Satisfactory Subway Access: When controlling for all other predictors, a neighborhood with satisfactory transit access is expected to have approximately 30.22% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.99%, 49.09%) non citizen residents, indicating that neighborhoods with satisfactory transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Excellent Subway Access: When controlling for all other predictors, a neighborhood with excellent transit access is expected to have approximately 29.84% more non-citizens than a neighborhood with poor transit access. There’s an 80% probability that this increase could lie anywhere between (13.34%, 48.51%) non citizen residents, indicating that neighborhoods with excellent transit access almost certainly have more non citizen residents than neighborhoods with poor access.

  • Brooklyn: When controlling for all other predictors, a neighborhood in Brooklyn is expected to have approximately 16.61% more non-citizens than a neighborhood in the Bronx. There’s an 80% probability that this increase could lie anywhere between (1.68%, 34.07) non citizen residents, indicating that neighborhoods in Brooklyn almost certainly have more non citizen residents than the Bronx, but the magnitude of this increase may vary.

  • Manhattan: When controlling for all other predictors, a neighborhood in Manhattan is expected to have approximately 20.58% more non-citizens than in the Bronx. There’s an 80% probability that this increase could lie anywhere between (2.268, 42.85) non citizen residents, indicating that neighborhoods in Manhattan almost certainly have more non citizen residents than the Bronx.

  • Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a 0.0793% decrease in non citizen count. However, there is a 80% chance that the decrease in non citizen count may be any value between (0.1102, 0.0476), indicating that there is almost certainly a negative relationship between mean income and non citizen count, but its magnitude may vary.

  • Mean Rent: When controlling for all other predictors, a 100 dollar increase in mean neighborhood rent is associated with approximately a 6.29% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (3.96%, 8.68%), indicating that there is almost certainly a positive relationship between mean rent and non citizen count, but its magnitude may vary.

  • Black Percentage: When controlling for all other predictors, a 10% increase in the Black population in a neighborhood is associated with approximately a 5.20% increase in non citizen counts. However, there is an 80% chance that the increase in non citizen count may be any value between (2.99%, 7.45%), indicating that there is almost certainly a positive relationship between Black resident percentage and non citizen count, but its magnitude may vary.

  • Latinx Percentage: When controlling for all other predictors, a 10% increase in the Latinx population in a neighborhood is associated with approximately a 13.46% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (10.08%, 17.07%), indicating that there is almost certainly a positive relationship between Latinx resident percentage and non citizen count, but its magnitude may vary.

  • Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a 19.63% increase in non citizen count. However, there is an 80% chance that the increase in non citizen count may be any value between (15.75%, 23.69%), indicating that there is almost certainly a positive relationship between Asian resident percentage and non citizen count, but its magnitude may vary.

Model 2: Mean Neighborhood Rental Prices

pp_check(hi_rent_model)+ 
  xlab("Mean Rental Price") +
  labs(title = "Hierarchical Normal Model of \nMonthly Mean Rental Price")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

rent_clean <- modeling_data %>%
  na.omit()%>%
  arrange(mean_rent)

list <- rent_clean %>%
  arrange(mean_rent) 


set.seed(84735)

predictions_rent <-  posterior_predict(
  hi_rent_model, newdata = rent_clean)


ppc_intervals_grouped(
  y = list$mean_rent, 
  group = list$borough,
  yrep = predictions_rent,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = list$nta_id,
    breaks = 1:nrow(list)) +
  
  labs(x = "Neighborhood", y = "Mean Rental Price (100$)")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

tidy(hi_rent_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Hierarchical Rent Model - Summary") %>% 
  kable_styling()
Hierarchical Rent Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 1562.316626 23.0792251 1532.4213790 1591.6529681
mean_income 0.012082 0.0010604 0.0107031 0.0133766

For rent model:

  • Store Count: When controlling for all other predictors, as grocery store and food vendor counts in a neighborhood increase by 1, rent increases by approximately $1.049 dollars. However, there is a 80% chance that the increase associated with store count may be any value between (0.307, 1.799) dollars, indicating that there is almost certainly a positive relationship between store count and rent, but its magnitude may vary slightly.

  • Mean Income: When controlling for all other predictors, a 100 dollar increase in mean neighborhood income is associated with approximately a $1.20 increase in rent. However, there is a 80% chance that the increase associated with rental price increases may be any value between (1.129, 1.284) dollars, indicating that there is almost certainly a positive relationship between mean income and mean rental prices, but its magnitude may vary slightly.

  • Black Percentage: When controlling for all other predictors, a 10% increase in the black proportion of a neighborhood is associated with approximately a $10.20 decrease in rent. However, there is an 80% chance that the decrease associated with rent may be any value between (19.36, 0.685) dollars, indicating that there is almost certainly a negative relationship between black resident percentage and rent, but its magnitude may vary.

  • Asian Percentage: When controlling for all other predictors, a 10% increase in the Asian population in a neighborhood is associated with approximately a $20.52 increase in rent. However, there is an 80% chance that the increase associated with rent may be any value between (5.763, 34.87), indicating that there is almost certainly a positive relationship between Asian resident percentage and rent, but its magnitude may vary greatly.

Model 3: Eviction Count

pp_check(hi_eviction_model)+ 
  xlab("Eviction Count") +
  xlim(0,2000)+
  labs(title = "Hierarchical Negative Binomial Model of \nEviction Count")+
  theme(plot.title =  element_text(family="DIN Condensed", face="bold", size=25, hjust=.5)) 

eviction_clean <- modeling_data %>%
  na.omit()%>%
  arrange(eviction_count)

list <- eviction_clean %>%
  as.tibble() %>%
  arrange(eviction_count) 


set.seed(454)

predictions_eviction <-  posterior_predict(
  hi_eviction_model, newdata = eviction_clean)

ppc_intervals_grouped(
  y = list$eviction_count, 
  group = list$borough,
  yrep = predictions_eviction,
  prob_outer = 0.8,
  facet_args=list(scales="free")) +
  
  scale_x_continuous(
    labels = eviction_clean$nta_id,
    breaks = 1:nrow(eviction_clean)) +
  
  labs(x = "Neighborhood", y = "Eviction Counts")+

  
  xaxis_text(angle = -70, vjust = 1, hjust = 0)+
  
  theme_linedraw() +
  
  theme(panel.grid.minor = element_line("transparent"),
        panel.grid.major.x = element_line("transparent"),
        axis.text.x = element_blank(),
        plot.title = element_text(family="DIN Condensed", size = 30, hjust=.5, face = "bold"),
        strip.background.x = element_rect(fill="Black", color="Black"),
        strip.text.x = element_text(size = 30, 
                                    color = "White",
                                    face = "bold"))

tidy(hi_eviction_model, effects = "fixed", conf.int = TRUE, conf.level = 0.8)%>% 
  
  mutate(estimate= ifelse(term == "(Intercept)", exp(estimate), (exp(estimate)-1)*100), 
         conf.low= ifelse(term == "(Intercept)", exp(conf.low), (exp(conf.low)-1)*100), 
         conf.high = ifelse(term == "(Intercept)", exp(conf.high), (exp(conf.high)-1)*100))%>%
  filter(conf.low   > 0 & conf.high > 0 | conf.low  < 0 & conf.high < 0) %>%
  kable(align = "c", caption = "Eviction Model - Summary") %>% 
  kable_styling()
Eviction Model - Summary
term estimate std.error conf.low conf.high
(Intercept) 15.8064568 0.6569704 6.7997728 39.0783257
transportation_desert_4catLimited 40.8716224 0.1132674 20.8924906 62.6549732
transportation_desert_4catSatisfactory 42.0549771 0.1213756 22.1516181 65.9214855
transportation_desert_4catExcellent 40.3339633 0.1208082 20.5187771 63.6034304
total_pop 0.0022269 0.0000019 0.0019830 0.0024683
gini_neighborhood 827.4356794 1.1563761 107.0982856 3722.7173470
mean_income -0.1153845 0.0002841 -0.1528786 -0.0781144
mean_rent 4.5177327 0.0193389 1.9549905 7.1889348
black_perc 18.0344197 0.0189980 15.2142994 20.9750189
latinx_perc 7.6893669 0.0298007 3.5720720 11.7586903
asian_perc -4.2338560 0.0301558 -7.7706435 -0.2685338

After removing predictors whose 80% credible intervals included the possibility of non-effect when controlling for other covariates, we found that there were 13 remaining predictors of an arbitrary neighborhood’s eviction counts. In the following list, categorical predictors’ classes were listed together:

  • Subway Accessibility (Limited, Satisfactory, Excellent)
  • Total Population
  • Borough (Brooklyn, Manhattan, Queens)
  • Gini Index
  • Mean Income
  • Mean Rent
  • Black Percentage
  • Latinx Percentage
  • Asian Percentage

In the following section, we interpret each predictor. However, we included total population to control for eviction count’s incidence relative to neighborhood population. As such, we will not interpret its specific \(\beta\) value.

Next, we interpret each predictor:

  • Limited Subway Access: When controlling for all other predictors, a neighborhood with limited access to the subway is expected to have approximately 39.81% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (21.10%, 61.52%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Satisfactory Subway Access: When controlling for all other predictors, a neighborhood with satisfactory access to the subway is expected to have approximately 41.25% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (21.89%, 64.39%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Excellent Subway Access: When controlling for all other predictors, a neighborhood with excellent access to the subway is expected to have approximately 40.58% more eviction counts than a neighborhood with poor subway access. However, there is an 80% chance that the relationship may be any value between (20.67%, 63.63%), indicating that there is almost certainly an increase in white population counts between these two neighborhoods, but its explicit magnitude may vary.
  • Gini Index: When controlling for all other predictors, 1 increases in income inequality are associated with approximately 658.52% increases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (73.738%, 3243.93%), indicating that there is almost certainly a strong, positive, relationship between these two variables, but its magnitude may vary.
  • Mean Income: When controlling for all other predictors, 100 dollar increases in mean neighborhood income are associated with approximately 0.1182% decreases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (-0.0799%, -0.1553%), indicating that there is almost certainly a weak, negative, relationship between these two variables, but its magnitude may vary slightly.
  • Mean Rent: When controlling for all other predictors, 100 dollar increases in mean neighborhood rental prices are associated with approximately 4.55% increases in the number of evictions. However, there is a 80% chance that the relationship may be any value between (1.94%, 7.13%), indicating that there is almost certainly a positive relationship between these two variables, but its magnitude may vary slightly.
  • Black Percent: When controlling for all other predictors, 10% increases in the proportion of Black residents in a neighborhood are associated with approximately 17.77% increases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (15.03%, 20.67%), indicating that there is almost certainly a strong, positive, relationship between these two variables, but its magnitude may vary slightly.
  • Latinx Percent: When controlling for all other predictors, 10% increases in the proportion of Latinx residents in a neighborhood are associated with approximately 6.60% increases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (2.73%, 10.69%), indicating that there is almost certainly a weak, positive, relationship between these two variables, but its magnitude may vary.
  • Asian Percent: When controlling for all other predictors, 10% increases in the proportion of Asian residents in a neighborhood are associated with approximately 4.21% decreases in the number of evictions. Further, there is a 80% chance that the relationship may be any value between (-.2344%, -7.88%), indicating that there is almost certainly a weak, negative, relationship between these two variables, but its magnitude may vary slightly.

Model Comparisons

Immigrant/Non-Citizen Count

table1 <- prediction_summary(model=noncit_model, data=modeling_data) %>% 
  mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=hi_noncit_model, data=modeling_data) %>% 
  mutate(Model = "Hierarchical")

rbind(table1, table2) %>%
  dplyr::mutate(model = Model, .before=1) %>%
  dplyr::select(-Model)%>% kable() %>% kable_styling()
model mae mae_scaled within_50 within_95
Non-Hierarchical 1075.502 0.5396179 0.5722222 0.9722222
Hierarchical 1097.958 0.5464003 0.5666667 0.9666667

Using in-sample scaled MAE, it’s evident that the differences between our two negative binomial models of non-citizen counts were negligible. In other settings, we would typically recommend to use cross-validated error metrics to truly compare these models. However, because we are comparing a hierarchical model to a non-hierarchical model, cross-validation works slightly differently. In the former case, cross-validation via the prediction_summary_cv function in bayesrules makes data permutations to represent a “new borough”, rather than a set of arbitrary models. To circumvent these differences, we use the expected-log predictive density (ELPD) of each respective model to compare the performance of each model. The details of this approach can be found here.

mod1_elpd <- loo(noncit_model)
hi_mod1_elpd <- loo(hi_noncit_model)

# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd)  %>%
  kable(caption = "Non-Citizen Count EPLD Metrics", align="c") %>%
  kable_styling()
Non-Citizen Count EPLD Metrics
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
hi_noncit_model 0.000000 0.0000000 -1607.588 11.53057 13.75086 2.419586 3215.176 23.06113
noncit_model -0.703769 0.7105637 -1608.292 11.51072 14.81440 2.354994 3216.584 23.02144

From the above ELPD rankings, it seems that the non-hierarchical model of non-citizen counts performed better than its hierarchical parallel. So, we will use the non-hierarchical model of non-citizen counts.

Mean Neighborhood Rental Prices

table1 <- prediction_summary(model=rent_model, data=modeling_data) %>% 
  mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=hi_rent_model, data=modeling_data) %>% 
  mutate(Model = "Hierarchical")

rbind(table1, table2) %>%
  dplyr::mutate(model = Model, .before=1) %>%
  dplyr::select(-Model)%>% kable() %>% kable_styling()
model mae mae_scaled within_50 within_95
Non-Hierarchical 0.8393263 0.5202047 0.6111111 0.9555556
Hierarchical 0.8591985 0.3125708 0.8611111 0.9944444

Once again when using in-sample scaled MAE, the differences between our two normal-normal models of mean rental prices were largely minimal. Our predictions of rental price in both models were about 80 dollars away from their observed rental prices. However, the scaled mean absolute error metrics in the hierarchical model indicate that across simulations the observed value is about 0.3 standard deviations away from their the medians of the posterior predictive distributions of rental prices. This is a nontrivial improvement from the .5 standard deviation in the case of the non-hierarchical distribution. Once again, we use ELPD metrics to formally decided between the two models.

mod1_elpd <- loo(rent_model)
hi_mod1_elpd <- loo(hi_rent_model)

# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd)  %>%
  kable(caption = "Mean Rental Price EPLD Metrics", align="c") %>%
  kable_styling()
Mean Rental Price EPLD Metrics
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
rent_model 0.00000 0.0000 -343.9309 14.763163 20.327345 4.240698 687.8618 29.526325
hi_rent_model -36.24023 10.0377 -380.1711 4.785256 7.418874 1.186835 760.3422 9.570511

From the above ELPD rankings, it seems that the hierarchical model of mean neighborhood rental prices performed better than its non-hierarchical parallel.

rent_resid <- modeling_data %>%
  mutate(resid = rent_model$residuals) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#30969C",
                      guide = guide_legend(title = "Residual")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Non-Hierarchical Model:\nMean Rental Price Residuals \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 28, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

hi_rent_resid <- modeling_data %>%
  mutate(resid = hi_rent_model$residuals) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#30969C",
                      guide = guide_legend(title = "Residual")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Hierarchical Model:\nMean Rental Price Residuals \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 28, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))
ggarrange(rent_resid, hi_rent_resid, ncol=2, nrow=1)

In the case of the non-hierarchical model, the distribution of our residuals indicate that our model’s mispredictions are randomly distributed. Further, our residuals in the non-hiearchical model are relatively small when compared to the hierarcichal model. Tthe hierarcichal model must be addressed. In the hiearchical model grouped by borough there is a clear indication of systematic mispredictions that appear distributed by borough. Crucially, it seems that we are systematically overpredicting rental prices in both Queens (bottom right) and Brooklyn (bottom left), but with greater overprediction occurring for Brooklyn. The hierarchical model’s residual patterning, ELPD metric, and in-sample residual metrics all indicate that the non-hierarchical version is much more appropriate.

Eviction Count

table1 <- prediction_summary(model=eviction_model, data=modeling_data) %>% 
  mutate(Model = "Non-Hierarchical")
table2 <- prediction_summary(model=eviction_model, data=modeling_data) %>% 
  mutate(Model = "Hierarchical")

rbind(table1, table2) %>%
  dplyr::mutate(model = Model, .before=1) %>%
  dplyr::select(-Model)%>% kable() %>% kable_styling()
model mae mae_scaled within_50 within_95
Non-Hierarchical 40.71300 0.5858540 0.5944444 0.9611111
Hierarchical 40.84712 0.5621431 0.6000000 0.9611111

Once again when using in-sample scaled MAE, the differences between our two negative-binomial models of evictions count were largely minimal. Our predictions of eviction counts in both models were about 40 cases away from their observed counts and with very similar scaled MAE metrics. Typically, we’d suggest the use of the non-hierarchical model given that the performance is so similar and the computational burden of a hierarchical model is no longer warranted.

mod1_elpd <- loo(eviction_model)
hi_mod1_elpd <- loo(hi_eviction_model)

# Compare the ELPD for our 2 models
loo_compare(mod1_elpd, hi_mod1_elpd)  %>%
  kable(caption = "Eviction Count EPLD Metrics", align="c") %>%
  kable_styling()
Eviction Count EPLD Metrics
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
hi_eviction_model 0.0000000 0.0000000 -1053.277 13.37170 14.53998 2.147071 2106.554 26.74339
eviction_model -0.0691807 0.4712791 -1053.346 13.40578 14.97743 2.180946 2106.692 26.81155

From the above ELPD rankings, it also seems that the hierarchical model of eviction counts performed better than its non-hierarchical parallel. Lastly, we will compare how these 2 models’ residuals are spatially distributed.

evict_resid <- modeling_data %>%
  mutate(resid = eviction_model$residuals) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#C47572", 
                      guide = guide_legend(title = "Residual")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Non-Hierarchical Model:\nEviction Count Residuals \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 28, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

hi_evict_resid <- modeling_data %>%
  mutate(resid = hi_eviction_model$residuals) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#C47572", 
                      guide = guide_legend(title = "Residual")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("Hierarchical Model:\nEviction Count Residuals \nin NYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 28, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))
ggarrange(evict_resid, hi_evict_resid, ncol=2, nrow=1)

Our residuals are randomly distributed across all neighborhoods indicating that our model’s bias is consistent across observations— this is really good! One could make the case that there are slight under predictions (darker red) of eviction counts in the Bronx (top) for both models. However, these differences are once again negligible. Altogether, we are selecting the hiearchical model of eviction counts given its relative computational efficiency.

Discussion & Extensions

We’ve confirmed that even after adjusting for income, rental prices, income inequality, and borough differences, neighborhoods in NYC with a substantial proportion of Black and Latinx residents are experiencing dramatic increased risks of eviction. Interestingly, when accounting for economic and demographic predictors, the relationships with transportation we expected to see were reversed. That is, increased transportation access was associated with more evictions. This could be related to the increased population sizes in areas with the best access to transportation (e.g. Manhattan) or it could be because the disparities that transportation inaccess reflects (e.g. racial and class-based tensions) have been accounted for.

Importantly, we found that there was statistically significant spatial clustering of both eviction counts and mean rental prices using Moran’s I from the spdep package.

library(spdep)
modeling_data <- modeling_data %>%
  st_as_sf()

col_sp <- as(modeling_data, "Spatial")
col_nb <- poly2nb(col_sp) # queen neighborhood
col_listw <- nb2listw(col_nb, style = "B") # listw version of the neighborhood
W <- nb2mat(col_nb, style = "B") # binary structure

evict_moran <- tidy(moran.mc(col_sp$eviction_count, listw = col_listw, nsim = 999, alternative = "greater")) %>%
  mutate(Term = "Eviction", .before=1)   

rent_moran <- tidy(moran.mc(col_sp$mean_rent, listw = col_listw, nsim = 999, alternative = "greater")) %>%
  mutate(Term = "Mean Rent", .before=1)   
  
rbind(rent_moran, evict_moran) %>%
  kable() %>%
  kable_styling()
Term statistic p.value parameter method alternative
Mean Rent 0.6560430 0.001 1000 Monte-Carlo simulation of Moran I greater
Eviction 0.5794612 0.001 1000 Monte-Carlo simulation of Moran I greater

Our results are then likely biased by the spatial relationships between neighborhoods. As such, we could extend this work to the spatial domain using methods from the CARBayes or INLA packages. Following work by Katie Jolly and Raven McKnight, we have outlined a spatial workflow for both the eviction and rental price models using CARBayes in the appendix. However, because spatial models were beyond the scope of this project and class, we’d like to emphasize that in a more detailed analysis the models we outline in the appendix would likely be adjusted in STAN or CARBayes to use different likelihoods, spatial effect priors (e.g. BYM, Intrinsic CAR, etc), and different priors for the predictors’ coefficients. Further, in a more detailed analysis, we would explicitly describe the mathematical construction of the models.

Although a lot of work went into designing models with causal blocking and confounding in mind, our models remain imperfect. Some major limitations involved the encoding of transportation deserts and the presence of unmeasured confounders.Because we may not have not accounted for all structural predictors in housing equity such as a neighborhood’s rent control policies nor have we adjusted for the reasons behind eviction, we cannot be entirely confident about our models’ conclusions.

Across our models, it becomes clear that many of the structural housing and demographic issues present in NYC need to be more rigorously addressed by both policy-makers and its citizens, regardless of these particular models’ performance. Health begins at home. And if NYC’s Black and Latinx residents are being crushed under the fist of inequity and consequently experiencing increased risks of eviction or tenuous rental prices, then it becomes a health imperative to critically and revolutionarily address NYC’s housing system.

Appendix

Spatial Rent

library(CARBayes)

equation <- mean_rent ~ total_pop  + 
  transportation_desert_4cat +  bus_count +
  school_count + store_count + borough + 
  gini_neighborhood + mean_income + black_perc + latinx_perc + asian_perc



tidy(moran.mc(col_sp$asian_perc, listw = col_listw, nsim = 999, alternative = "greater")) %>%
  mutate(Term = "Eviction", .before=1)   
## # A tibble: 1 x 6
##   Term     statistic p.value parameter method                        alternative
##   <chr>        <dbl>   <dbl>     <dbl> <chr>                         <chr>      
## 1 Eviction     0.674   0.001      1000 Monte-Carlo simulation of Mo… greater
carbayes_rent <- S.CARleroux(equation,
                 data = modeling_data, 
                 W = W, family = "gaussian", 
                 prior.mean.beta = c(1600, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),
                 prior.var.delta = c(20, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),
                 burnin = 20000, n.sample = 100000, rho=1,
                 thin = 10, verbose=F)

moran.mc(x = as.vector(carbayes_rent$residuals$response), listw = col_listw, nsim = 9999, alternative = "greater")
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  as.vector(carbayes_rent$residuals$response) 
## weights: col_listw  
## number of simulations + 1: 10000 
## 
## statistic = 0.022883, observed rank = 7285, p-value = 0.2715
## alternative hypothesis: greater
t(carbayes_rent$modelfit)%>%
  as.tibble() %>%
  kable(align = "c", caption = "CARBayes Spatial Eviction Model - DIC Metrics") %>% 
  kable_styling()
CARBayes Spatial Eviction Model - DIC Metrics
DIC p.d WAIC p.w LMPL loglikelihood
663.6964 32.63405 673.4911 37.31906 -343.5557 -299.2141
round(carbayes_rent$summary.results[, 1:7], 4) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  rename(term = rowname) %>%
  as.tibble() %>%
  mutate(`Median`= ifelse(term == "(Intercept)", exp(`Median`), (exp(`Median`)-1)*100), 
         `2.5%`= ifelse(term == "(Intercept)", exp(`2.5%`), (exp(`2.5%`)-1)*100), 
         `97.5%` = ifelse(term == "(Intercept)", exp(`97.5%`), (exp(`97.5%`)-1)*100))%>%
  filter(`2.5%` > 0 & `97.5%` > 0 | `2.5%`  < 0 & `97.5%` < 0) %>%
  mutate_if(is.numeric, ~round(., 4))  %>%
  kable(align = "c", caption = "CARBayes Spatial Rent Model - Summary") %>% 
  kable_styling()
CARBayes Spatial Rent Model - Summary
term Median 2.5% 97.5% n.sample % accept n.effective Geweke.diag
(Intercept) 5517.6096 25.2872 2106840.5213 8000 100 967.8 -3.7
mean_income 1.1870 1.0454 1.3186 8000 100 635.1 4.1
nu2 627.9047 240.8254 1361.4293 8000 100 86.4 2.1
tau2 25.6965 0.3707 1031.0144 8000 100 68.8 -2.2
rho 171.8282 171.8282 171.8282 NA NA NA NA
modeling_data %>%
  mutate(resid = carbayes_rent$residuals$response) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#30969C",
                      guide = guide_legend(title = "Residuals")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("CARBayes Besag:\nMean Rental Price Residuals in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

waic(rent_model)$estimates
##             Estimate        SE
## elpd_waic -343.32309 14.512615
## p_waic      19.71956  3.971957
## waic       686.64618 29.025230
waic(hi_rent_model)$estimates
##             Estimate       SE
## elpd_waic -380.03622 4.747577
## p_waic       7.28398 1.145918
## waic       760.07243 9.495153
carbayes_rent$modelfit 
##           DIC           p.d          WAIC           p.w          LMPL 
##     663.69636      32.63405     673.49113      37.31906    -343.55575 
## loglikelihood 
##    -299.21413

Spatial Eviction

library(CARBayes)

equation <- eviction_count ~ offset(log(total_pop)) + 
  transportation_desert_4cat + borough + unemployment_count+
  below_poverty_line_perc  + mean_income +  
  mean_rent + black_perc + latinx_perc + asian_perc


carbayes_evictions <- S.CARleroux(equation,
                 data = modeling_data, 
                 W = W, family = "poisson", 
                 burnin = 200000, n.sample = 1000000, rho=1,
                 thin = 10, verbose=F)



moran.mc(x = as.vector(carbayes_evictions$residuals$response), listw = col_listw, nsim = 9999, alternative = "greater")
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  as.vector(carbayes_evictions$residuals$response) 
## weights: col_listw  
## number of simulations + 1: 10000 
## 
## statistic = -0.20948, observed rank = 1, p-value = 0.9999
## alternative hypothesis: greater
t(carbayes_evictions$modelfit)%>%
  as.tibble() %>%
  kable(align = "c", caption = "CARBayes Spatial Eviction Model - DIC Metrics") %>% 
  kable_styling()
CARBayes Spatial Eviction Model - DIC Metrics
DIC p.d WAIC p.w LMPL loglikelihood
1641.226 169.5443 1602.378 94.0918 -911.7143 -651.0687
round(carbayes_evictions$summary.results[, 1:7], 4) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  rename(term = rowname) %>%
  as.tibble() %>%
  mutate(`Median`= ifelse(term == "(Intercept)", exp(`Median`), (exp(`Median`)-1)*100), 
         `2.5%`= ifelse(term == "(Intercept)", exp(`2.5%`), (exp(`2.5%`)-1)*100), 
         `97.5%` = ifelse(term == "(Intercept)", exp(`97.5%`), (exp(`97.5%`)-1)*100))%>%
  filter(`2.5%` > 0 & `97.5%` > 0 | `2.5%`  < 0 & `97.5%` < 0) %>%
  mutate_if(is.numeric, ~round(., 4))  %>%
  kable(align = "c", caption = "CARBayes Spatial Eviction Model - Summary") %>% 
  kable_styling()
CARBayes Spatial Eviction Model - Summary
term Median 2.5% 97.5% n.sample % accept n.effective Geweke.diag
(Intercept) 0.0034 0.0014 0.0075 80000 45.7 197.0 -2.4
mean_income -0.1299 -0.1898 -0.0800 80000 45.7 123.8 1.6
mean_rent 5.4746 1.9487 9.3409 80000 45.7 166.0 -1.3
black_perc 19.1365 14.1907 24.4209 80000 45.7 292.6 1.5
latinx_perc 11.6278 5.8550 17.9511 80000 45.7 231.7 2.8
tau2 45.7321 34.7431 61.6398 80000 100.0 6450.7 1.9
rho 171.8282 171.8282 171.8282 NA NA NA NA
modeling_data %>%
  mutate(resid = carbayes_evictions$residuals$response) %>%
  ggplot() +
  geom_sf(aes(fill = resid), color = "#8f98aa")+
  scale_fill_gradient(low = "#FCF5EE", high = "#C47572", 
                      guide = guide_legend(title = "Number of Evictions")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank()) +
  ggtitle("CARBayes Besag:\nEviction Count Residuals in \nNYC")+ 
    theme(panel.grid.major = element_line("transparent"),
          plot.title = element_text(family="DIN Condensed", size = 2* 30, face = "bold"),
          legend.title = element_text(size = 12), 
          legend.text = element_text(size = 12)) + 
    guides(shape = guide_legend(override.aes = list(size = 8)),
           color = guide_legend(override.aes = list(size = 8)))

waic(eviction_model)$estimates # best
##              Estimate        SE
## elpd_waic -1053.08406 13.371295
## p_waic       14.71553  2.109063
## waic       2106.16811 26.742590
waic(hi_eviction_model)$estimates 
##              Estimate        SE
## elpd_waic -1053.05331 13.339327
## p_waic       14.31652  2.084181
## waic       2106.10663 26.678654
carbayes_evictions$modelfit
##           DIC           p.d          WAIC           p.w          LMPL 
##     1641.2260      169.5443     1602.3777       94.0918     -911.7143 
## loglikelihood 
##     -651.0687